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ABSTRACT 

Information  Theory  is  applicable  to  a  number  of  fields o 

The  basic  statistic  of  Information  Theory,  -  y  p.  log  p  ,  Ls 

/ 

derived  for  the  discrete  case  using  Bayes's  rule  for  the 

probability  of  causes .   Various  properties  of  this  function 
are  derived  and  discussed. 

The  writer  wishes  to  express  his  appreciation  for  the 
assistance  and  encouragement  given  him  by  Professor 
Randolph  Church  of  the  U,  So  Naval  Postgraduate  School  in 
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1 .   Introduction 

In  1948  Claude  Shannon  published  his  now  famous  paper 
entitled  "The  Mathematical  Theory  of  Communication"    11   , 
later  reprinted  with  a  paper  by  Warren  Weaver  as  reference 
I  12   i  in  which  he  defined  a  communication  system  as  being 
composed  of  an  information  source,,  a  transmitter,  a  channel, 
a  receiver,  and  the  destination.   The  fundamental  problem  in 
such  a  system  is  the  reproduction  at  the  destination  either 
exactly  or  approximately  a  message  selected  at  the  informa- 
tion source,   His  approach  was  statistical  in  nature,  that  is, 
he  did  not  consider  the  semantic  aspects  of  the  message  but 
rather  that  the  message  is  one  of  a  set  of  possible  messages  . 
Among  other  things  he  showed  that  under  certain  conditions  it 
is  possible  to  encode  the  transmitted  information  so  that  it 
would  be  received  with  an  arbitrarly  small  frequency  of  error 

Basic  to  his  arguments  is  a  concept  known  in  thermody- 
namics as  entropy,  which  can  be  described  as  a  measure  of  the 
amount  of  disorder  in  a  physical  system.   This  implies  that  a 
message,  in  some  sense,  represents  a  certain  amount  of  dis- 
order and  that  there  is  a  measure  of  this  disorder  which  can 
be  obtained  and  used. 

Kullbach    7    recently  published  a  book  in  which  he 
uses  the  same  basic  concept  to  provide  a  unifying  background 
for  the  study  of  the  testing  of  statistical  hypotheses  , 


Bagno    1    uses  Shannon's  theorems  to  arrive  at  some  start- 
ling conclusions  relative  to  economic  theory.   Miller's    9 
article  presents  a  short  discussion  on  the  use  of  these  con- 
cepts in  the  field  of  psychology.   Thus  Information  Theory 
apparently  has  a  wide  field  of  applications. 

What  is  information?   One  dictionary   lists  among  its 
several  definitions  "  .  „  .  knowledge  communicated  or  received 
concerning  some  fact  .  ,o".   We  say  that  knowledge  is  certain- 
ty and  that  Information  Theory  is  the  study  of  the  ability  of 
systems  to  transmit  certainty  or  equivalently ,  of  the  change 
in  an  observer's  state  of  uncertainty  when  he  has  performed 
experiments  on  a  situation  and  drawn  conclusions  (gained 
knowledge)  from  the  results  , 

In  the  following  pages  we  will  develop  the  basic  sta- 
tistic of  Information  Theory  and  show  that  it  is  a  logical 
and  appealing  measure  of  information  in  the  above  sense  .   The 
basis  of  the  development  will  be  Probability  Theory  and 
specifically  an  application  of  Bayes's  rule. 


1 

The  New  Century  Dictionary,  Appleton,  Century,  Crofts,  1948 


2 .   Bayes  Rule  and  Information 

Let  us  perform  a  simple  experiment  and  see  if  there  is 
a  pattern  or  characteristic  which  can  be  exploited  to  obtain 
a  statistic  which  relates  a  situation  prior  to  one  or  more 
observations  to  the  situation  after  the  observations,, 

We  can  represent  the  situation  by  A,  where  A  is  composed 

of  m  mutually  exclusive  and  exhaustive  events,  a,,a„,o„.,a  ; 

12      m 

the  outcomes  of  an  observation  by  B e    where  B  is  composed  of  n 

mutually  exclusive  and  exhaustive  events  b, ,b„, .«o,b  «>   An 

12  n 

event   a. occurs  with  probability  p(a.) ,    0    <   p(a.)<[  1 

/   p(a   )=1;      an   event  b      occurs  with  probability  p(b.)  , 
*^->         i  i  J 

0<T  p(b.)  <  1,       )  p(b.)=l„ 
—         J  *—       J 

J 

Consider  that  we  have  been  given  two  nickels,  told  that 
one  is  fair  and  the  other  biased  so  the  probability  of  heads 
appearing  when  tossed  is  \t    and  asked  to  determine  which  is 
the  fair  nickel „   Let  the  experiment  consist  of  choosing  one 
of  the  two  nickels  at  random,  tossing  it  twice  and  noting 
the  outcome   of  both  tosses .   Based  on  the  results  of  this 
experiment  we  are  to  state  the  probability  that  the  chosen 
nickel  is  fair „ 

Let  a   denote  the  event:   the  fair  nickel  is  chosen,  and 
a   denotes  the  event:   the  biased  nickel  is  chosen  .   Let  b 
denote  the  event:   the  nickel  comes  up  heads,  and  b  denote 
the  event:   the  nickel  comes  up  tails. 


Since  we  make  a  random  choice ,  the  probability  that  the 
fair  nickel  is  chosen,,  p(a  )  ,  is  equal  to  \t    and  the  proba- 
bility that  the  biased  nickel  is  chosen,  p(a7)  is  also  equal 
to  h. 

The  conditional  probability  of  a  specific  outcome  will 

be  denoted  by  p(b./a„)  where 

J   i 

p(b1/ai)  =  1/2, 

p(b2/a1)  =  1/2, 

p(b1/a2)  =  1/4, 
and  p(b2/a2)  =  3/4, 

Considering  now  the  first  toss  of  the  coin,  from  the 

definition  of  conditional  probability  we  can  write 

p(a„b.)  =  p(b./a.)  p(a.)  (1) 

i  j       J   i     i 

where  a  b   denotes  the  joint  occurrence  of  event*  a,  and  out- 
i  j  i 

come  b  *   When  the  a   are  mutually  exclusive  and  exhaustive 

j  i 

events , 

P(b.)  =   Yp(b./a.)  p(a.)  .  (2) 

J  ^t-'      J     1  1 

/ 

The  conditional  probability  of  event  a.  is  the  ratio  of  the 
probability  of  the  joint  occurrence  of  both  events  to  the 
probability  of  the  outcome,  or  in  symbols. 


1 
W,  Feller,  An  Introduction  to  Probability  Theory  and  Its  Ap- 
plications,, Second  Edition,  John  Wiley  &  Sons  Inc.,  Ch  <,  V, 
1957, 
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P(a.b.) 
i      j  p(b   ) 


(3) 


Substituting    (1)     and    (2)     into    (3) 


p(a./b.)    = 


p(b./ai)    p(3L±) 


i      j       "Vp(b  /a±)    p(a± 


.) 


(4) 


which    is   Bayes's   rule    for    the   probability  of   causes      if  we 
identify   the   event   a.    with   the   cause   and  outcome  b.    with    the 
effect . 

Using    the   given    values  we   can   calculate   the   conditional 
probability   of    a . „    which    in    the    context    of   Bayes's   rule    is 
known   as    the    aposteriori  probability    of   event    a    ,    as   con- 
trasted with  p(a.) ,    the    apr lori  probability   of   event   a. . 


Displayed   in   tabular   form, 


p(a   /b   ) 
J 


\    a 

.     J 

1 

2 

i 

1              2 
2/3            L/3 
2/5           3/5 

Thus  after  one  toss  of  the  nickel,  the  aposteriori  probabil- 
ity of  having  chosen  the  fair  nickel  is  2/3  if  we  had  ob- 
served heads,  and  2/5  if  we  had  observed  tails  <,   If  the  se- 
quence, selection  and  toss,  was  repeated  a  large  number  of 
times,  the  aposteriori  probabilities  represent  the  frequency 


loc  .  cit  , 


with  which  the  observer  would  be  correct  if.  he  associated  a 

given  choice  with  a  given  outcome  . 

We  will  now  toss  the  coin  for  the  second  time c   Defining 

the  joint  outcome  b  .b,  „    j  ,k  =  1,2,  as  the  pair  of  observations 

j    k 

made    in   the    two   tosses,    we    can   extend    (4)     to 


p(b.b  ,/a.)    p(a.) 

p(a  Tbb   )  J    K      1  .  (6) 

1      J    K        rP(b.b/a,)    p(a.) 
^        j    k      x  i 

/ 


The  conditional  probability  of  the  outcome  of  a  single  toss 

remains  the  same  once  a  coin  is  chosen,  therefore 

p(b  /a.)  =  p(b./a.)   for  j  =  k,  (7) 

jc   i        3 

Since  the  tosses  are  independent, 

p(b  b  /a.)  =  p(b./a.)  p(b  /a.)  ,  (8) 

J  JC    X  1        JC    X 

Substituting    (8)     into    (6) , 


p(b./a   )    p(b   /a.)    p(a.) 
j      i  k      i 


'i      jk;    -yp(bVaJ    ptt^/aj    p(aj 


We  are  now  in  a  position  to  calculate  the  aposteriori  prob- 
ability of  event  a,  after  two  observations  „   Again  using  the 
given  values  and  exhibiting  the  results  in  a  table 


p(a 

./b.b  ) 
i  j  k 

b  b     1 
J  k 

1 

2 

1  1 

4/5 

1/5 

1  2 

4/7 

3/7 

2  1 

4/7 

3/7 

2  2 

4/13 

9/13 

(10) 


Thus  if  after  two  tosses  we  had  observed  heads-heads,,  the 
aposteriori  probability  that  the  coin  is  fair  would  be  4/5. 

Notice  that  for  each  additional  toss  the  size  of  the 
table  required  to  describe  the  possible  outcomes  is  doubled, 
If  instead  of  this  almost  trivial  example  we  had  set  m  and  n 
equal  to  50,  the  table  required  to  describe  the  situation 
would  be  enormous.   It  would  be  desirable  to  have  a  simpler 
way  of  presenting  this  data. 

The  denominator  in  (9)  can  be  written  in  a  manner  anal- 
ogous to  (2)  as  p (b  .b  )  .   Multiplying  (9)  by 

J    k 


I 


p  (b  ./a. )    p  (a. ) 
/  1      i  i  and  rearranging 


I 


p(b./a.)    p(a.) 


p(a./b  b   )    = 
l      j   k 


z 


P(*V/a-)    /lp(b./a    )p(a.) 

K        1 / ] 1  1 

ypCbVaJ       P(\/a±)    P(a±) 

/ 

Simplifying    the    summations, 


P(b./a±) 


p(Vai 


)    P(a.) 


p(a.)  .    (11) 


p(a./b.b   )    = 
i      j    k 


p(b    /a.)    p(b.) 

P(b.bk) 


P(b./a.) 
i     i 

P(b.) 


P(a±)  . 


(12) 


Notice  that  the  expression  for  the  aposteriori  probability 

of  event  a   after  the  first  observation ,  (4) ,  appears  on  the 
i 

right  side  of  (11)  and  is  multiplied  by  a  fraction  whose  val- 
ue is  determined  by  the  outcome  of  the  second  observation. 
Thus  we  can  say  that  the  aposteriori  probability  of  event 
a.  after  the  first  observation  becomes  the  apriori  proba- 
bility of  event  a.  before  the  second  observation.   If  we  rep- 

resent  the  fraction  — ._  ,  . — *—   in  (12)  by  F_,  and  the 

p(bi  V)  2 

P(b./a.)  J  * 

fraction  — j   .    in  (3)  and  (12)  by  F  ,  we  can  write  (3)  as 

p(a./b.)  =  F   p(a.)  (13) 

i  j     l    i 

and  (12)  as 

p(aAV  ■  Fi  p2  »<V  (14) 

It  is  apparent  that  this  could  be  extended  to  any  number  of 
observations.   The  aposteriori.  probability  after  say  n  ob- 
servations is  the  product  of  n  F  factors  and  the  apriori  prob- 
ability for  the  first  observation.   This  then  is  the  pattern 
which  we  will  use „ 

We  will  define  information  as  a  statistic  which  meas- 
ures the  change  in  an  observers  belief  as  to  which  event  a., 

i 

from  a  set  of  mutually  exclusive  and  exhaustive  events  A,  is 

the  cause  of  outcome  b .  from  a  mutually  exclusive  and  ex- 

J 

haustive  set  of  outcomes  B,  where  Vp(a)  =  1,  )  p(b.)  =  1, 
and  0  <:    p(a. )  ,  p  (b  .)  <I  1.   We  require  the  following  mathe- 

8 


matical  properties  of  this  statistic:   a  „   additivity,  and 
b.   dependence  on  apriori  and  aposteriori  probabilities.   Ad- 
ditivity requires  that  the  total  amount  of  information  ob- 
tained from  a  sequence  of  observations  is  the  sum  of  the  a- 
mounts  obtained  from  the  individual  observations  . 

The  F.  defined  above  do  relate  to  apriori  and  aposte- 
riori probabilities  The  requirement  of  additivity  can  be 
met  by  use  of  the  function  log  F  for 

log  (F  F  „o.F  )  =  log(Fj  +  log(Fj  +  .  .=  +  log  (F  )  .  (15) 
1  I  n         1  2  n 


Shannon 


11,  12 


uses  base  two  logarithms,  defining  the 
unit  of  information  as  a  bit,  a  contraction  of  binary  digit. 
One  bit  of  information  corresponds  to  being  informed  of  the 
outcome  of  a  binary  equally  likely  selection,  a  unit  which 
is  convenient  since  a  relay  or  flip  flop  circuit  can  store 
one  bit  of  information.   Kullback    7    uses  natural  loga- 
rithms since  his  work  involves  integration  and  differenti- 
ation and  others  have  used  base  10  . 


3  .   Uncertainty 

We  have  indicated  that  a  measure  of  the  total  change  in 

belief  of  an  observer  is  the  sum  of  the  logarithms  of  the  F 

factors  .   Let  us  now  look  at  the  information  obtained  from 

the  first  observation  of  a  sequence   designated  as  I     . 

i  j 
From  the  definition  of  F  , 

p(b./a.) 

I   -h   -  log    n]y,   s       *  (16) 

a  .b  .         p  (b  J 

i  J  J 

and  from  (1)  and  (3)  we  see  that  we  can  also  represent  I   , 

a  .b 

1  J 

p(ai/b  ) 

as    I     =  log  77~\   »  (I7) 

a.b  .         p (a. ) 
1  j  1 

p(a.b.) 

and   1  :   -  log  - — f-^-r.  ,  (18) 

a.b        p(a.)  p(b  .) 
1  j  i     j 

Confining  our  attention  to  (17)  for  the  moment, if  p(a./b.)is 
greater  than  p(a.) ,  the  information  is  positive;   if  less 
than  p(a.) ,    the  information  is  negative.   Positive  informa- 
tion corresponds  to  an  increase  in  certainty.   If  p(a./b.) 

-1-   J 

was  equal  to  one,  we  would  be  certain  that  a.  was  the  cause 

1 

of  our  observed  outcome .   We  will  define  the  uncertainty  of 

event  a.  as  the  value  of  I   ,   when  p (a  /b.)  =  1,  thus  the 
1  a.b .         i   j 

1  J 

uncertainty  of  event  a .  is  -  log  p(a.)  .   This  is  the  maximum 

amount  of  information  which  can  be  obtained  concerning  a   in 

1 

one  observation   We  are  more  concerned  however  with  the  en- 
tire set  A  so  let  us  first  obtain  the  average  uncertainty 


10 


which  we  will  denote  by  H (A)  « 

H(A)  =  -yp(a±)  Log  p(a.)  ,  (19) 

l 

H (A)     is    descriptive    of    set  A  or    of    any   other    set  with    the 

same   number   of   events    and    the    same   probability   distribution. 
We    can   also   speak   of    the    average   uncertainty   of   observation 
set  B,    and   the    event-outcome   pair    set  AB   as 

H(B)    =    -Yp(b   )     log   p(b.K  (20) 

and 

H(AB)=-VyP(a,b.)     log   p(a.b,).  (21) 

'     J 
Consider   now   the    average    information   obtained   on    a  par- 
ticular  event.     a.e    if  we    average    over    all   possible    outcomes 
of    the    first   observation.      We   will    designate    this    average    as 

i    1 

Ia.Bi    =    Ip^j/V     Vb.' 

^  P(b   /a   ) 

=       )p(b./a.)    log  — — }— f    .  (22) 

Lr     j     i  p(b.) 

J  x 

This    average    is    obviously  equal    to   zero  when  p (b   /a   ) 

J   i 

is  equal  to  p(b.)  ;  recalling  from  (2)  that  p(b.)  is  equal  to 

\  p  (h  ./a .)    p(a  )  ;,  we  see  that  this  is  only  possible  when 

/ 

p(a.)  is  equal  to  unity,  or  when  all  the  p  (b  /a  )  are  equal 
i  j   i 

for  a  given  i»   In  the  first  case  we  would  obtain  no  informa- 
tion since  a.  is  the  only  event  which  can  occur,  and  in  the 

second  case  we  would  obtain  no  information  since  b   is  inde- 

J 

11 


pendent  of   a . .      We   will    show  that   in   all   other   cases    I    - 
1  a .  B 

i  1 

is  greater  than  zero..   An  inequality  from  Feinstein    4    can 
be  written  as 

Vq.  log  p,  <;   Yq,  log  q.  (2  3) 

/   .   1  JL,  t  1  -Li 

/  / 

for  0<q_.  ,    p .  <  1 ,  \  p.  =1,  ^q.  =1,  i  =  1 ,  2,  ...,»n,  with 

/  / 

equality  only  when  p.  =  q. .   Separating  the  logarithm  in  (22) 


i  l 


Za  n     =     yP(b./a   )    log  p  (b   /a   )    -  Vp  (b   /a   )     log  p  (b 
1   X  J  J 


i  (24) 


and   identifying   p(b./a.)    with   q.  ,    p(b.)    with   p.,    we    see    that 

ji         i      j         i 

the  second  term  on  the  right  is  always  less  than  or  equal  to 
the  first  term.   Now  since  the  logarithms  of  numbers  less 
than  one  are  negative,  the  difference  is  always  greater  than 
or  equal  to  zero    Therefore  the  average  information  from 
one  observation  on  a  single  event  is  a  positive  quantity 
corresponding  to  a  decrease  in  uncertainty,  or  is  zero  cor- 
responding to  no  change  .   Thus 

I   R   >  0  (25) 

a  j3 
i  1 

We  will  now  show  that  the  average  information  obtained 
from  a  sequence  of  observations  on  the  same  event  cannot  ex- 
ceed the  uncertainty  of  the  event.   From  (3)  and  (12)  we  can 

write  F„  as 
2 

p(a./b.b  ) 

JL   1  ]c 

F2  =   p(a./b.)  '  (26) 

and  the  average  information  obtained  from  the  second  obser- 
vation as 

12 


I   n   =)    p(b  /a.)  p(b./a.)  log  —7 ]-— .  (27) 

a  B    L-LS      k   i        1        p(a./b  ) 
i  2    j  K  1.   j 

By  an  argument  identical  to  that  used  to  prove  (25)  this 

average  is  also  greater  than  or  equal  to  zero.   Separating 

the  logarithms  in  I   _.   and  I      we  can  write  the  sum  as 

a   B  a.B^ 

1    1  12 

I      „      +    I      n      =    Y"p(b./a.)    log   p(a./b.)    -Yp(b./a.)    log  p(a.) 
1   1  1    2  j  J 

+  yVp(b/a.)    p(b./a.)    logp(a./b.b) 

J    K 
-VYp^/a.)    p(b./a.)     log  p(a./b.)  =  (28) 

J    K 

If  we   now  sum   the    last   term  on   the    right  over  k,    we    see    that 

the    result   is    identical   to    the    first   term,    thus   both   terms 

cancel   out,    leaving   us 


VB      +    Ia.Bo   -    LLP{\/ai]    P(Vai}     ^   P^i^jV 
il  1    2  j    K 

-VpQD./aJ     log  p(a.)  .  (29) 

J 

Similarly f  in  a  sequence  of  say  n  observations,  the  first 

term  in  the  expression  for  the  average  information  from  one 
observation  in  the  sequence  will  cancel  out  with  the  second 
term  in  the  expression  for  the  average  information  from  the 
next  observation  in  the  sequence.   The  sum  will  therefore 
consist  of  two  terms  such  as 

Yl=   n     =    YY      Vp(b  b         .b   /a.)     log  p(a,/b.b    .  .  .b   ) 
i—*  a.B  £_.£_  ..,  £__        jk  mi  ilk  m 

r=i       x   r  J     K         *" 

Zp(b./a.)     log   p(a.)  (30) 

J 
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where  b  is  the  first  outcome  and  b  the  last  outcome.  Now 
the  maximum  value  which  the  first  term  on  the  right  can  at- 
tain is  zero,  corresponding  to  p(a./b.b  . „ „b  )  =  1,  and  the 

i      j    k  m 

second   term   is    just   the    uncertainty   of   event   a.  .      Thus 


max 

i   r 


Da  B     =    -log     P(\>-  (31) 


Returning  to  the  case  of  a  single  observation,  let  us 

average  I      over  all  events  a   and  see  what  the  overall 
a  ,B,  i 

l    1 

average,    that    is,,    over  both   events    and  outcomes,    looks    like 

By    (18)    we    can  write    (22)     as 

p(a  b.) 

I  =     \p(b./a   )     log       .    \  ]     ..     .   .  (32) 

a  B  /f       i       L  "   p(a_.)    p(b.) 


r— •  i     i 

=      )p(b./a   )     log  .   ' 

i.B  L. .       1       l  *   p(a   )    p(b 

ix  i  j 


The  average  of  (32)  over  all  events,  designated  by  I   is 

r-  p(aibi} 

\  =  ZW  Lp{h/\]  log  puTTTb.)  (33) 

/  j  L        J 

By  (1)  and  (3)  and  separating  the  logarithms, 

I1  =  ^^P(axb)  |  log  P^aib)  "  lo(3   P(a±)  ~lo(3   P  (*>•)]  -(34) 
'  J 
By  (19) ,  (20)  and  (21)  we  see  that 

I   =  -H(AB)  +  H(A)  +  H(B)  .  (35) 

Since  we  have  established  that  I     is  nonnegative,  I  must 

a .  B  1 

i 

also  be  nonnegative  .   This  implies  that 

H(AB)  <  H(A)  +  H(B)  ,  (36) 

with  equality  when  A  and  B  are  independent,  or  one  or  both 
consist  of  one  event  with  probability  one .   We  can  also  write 
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133)  in  the  form 


T  =  )p<a  )  )  p(b./a  )  log  p  (b  . /a  . ) -)p  (b  ./a  . )  log  p  (b  . 
1  L   i  fc-    J   J         J   i   4-   J   i         J 

/     LJ  J 


.  (37) 


The  fLrst  term  inside  the  brackets  is  similar  in  form  to 
(19)  ,  (20)  and  (21)  and  we  will  define  the  uncertainty  of 
outcome  set  B.  given  a  specific  event  a.  as 

H(B/a.)  =  -Yp(b./a.)  log  p(b./a.)  ,  (38) 

J 

Rearranging  the  right  side  of  (37)f  and  substituting  (38)  , 

I1   =  -^p(ai)  H(B/ai)  +  ££p(aib)  lo9  P(b)  •       (39) 

/  '  J 

Now  the  first  term  on  the  right  is  the  average  conditional 

uncertainty  of  B  and  the  second  term  is  the  uncertainty  of  B, 

thus,  denoting  the  average  conditional  uncertainty  by  H (B/A) , 

I   =  -H(B/A)  +  H(B)  (40) 

Since  I   is  nonnegative, 


H  (B/A)  <  H  (B)  , 

with  equality  as  in  (36)  . 


(41) 


(16)  and  (17)  are  symmetric  expressions  in  a.  and  b,,  so 


an  equivalent  averaging  of  (17)  would  yield 


I   =  -H(A/B)  +  H(A) 


(42) 


and 


H(A/B)  <  H(A) 


(43) 
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4o   Properties  of  the  Uncertainty  Statistic 

A  number  of  properties  were  obtained  in  the  previous 
section   namely: 

H(AB)  <  H(A)  +  H(B)  (36) 

H(B/A)  <  H(B)  (41) 

and  H(A/B)  <  H  (A)  .  (43) 

We  will  now  consider  this  function  in  its  own  right  and 
obtam  some  additional  properties  ,   Subtracting  (40)  from  (35) 
we  find  that 

H(AB)   =   H(A)  +  H(B/A),  (44) 

and  subtracting  (42)  from  (35) , 

H(AB)   =   H(B)  +  H(A/B)  .  (45) 

Since  H  is  of  the  form  —  Yp.  log  p.  we  may  represent  it 

/ 
as  H(p  ,  p., ...,  p  )    By  the  property  of  the  logarithm, 
12       m 

lim  x  log  x  =  0, 
x-»0 

H(p.f  P9,  ...i  p.0)  =  H(p  ,  p  ,  ...,  p  ),  (46) 

12  m  12        m 

which  indicates  that  adding  an  impossible  event  to  a  set  does 
not  change  the  average  uncertainty. 

Let  us  now  determine  the  probability  distribution  for 

which  the  H  function  takes  on  its  maximum  value .  Using  the 

1 
method  of  Lagrange  multipliers   and  writing  H  as  a  function 

of  all  its  arguments,  we  form 


Kaplan,  W„  Advanced  Calculus   Addison  Wesley  Publishing  Co, 
pp  128-129,  1953, 
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W   =    H(P]/    p2,  ....    pm)     +       A[&i    "    X]   '  (47) 


/ 
or 


W 


=  -  I>t  log  pi      +  A  [!>!  -  i  ]  (48> 


Taking  partial  derivitives  of  W  with  respect  to  p., 

_&___  w  =~  p  „   0   log  p     -  log  p .    +  X      for  i  =  1  ,  2,     .  .m .  (49) 
v  In  i  i 

OP.        op. 

> 

Setting  this  equal  to  zero  to  obtain  the  extreme  point  and 
solving  for  log  p . , 

log  p  =  \  -   1,  (50) 

p  =  exp  (  X    ~   1)  •  (51) 

Summing  both  sides  of  (51)  on  i  and  solving  for  exp  ( X    ~   1)  > 

exp  (  X  -   1)  =   1/m,  (52) 

Substituting  (52)  into  (51) , 

p.   =  1/m.  (53) 

Thus  the  probability  distribution  for  which  the  average  un- 
certainty is  the  maximum  is  the  uniform  distribution,  or 

H(p..,  p0,  ...,P  )  <  H(l/m,  1/m,..., 1/m)  (54) 

1    z  m 

with  equality  when  p.  =  1/m „ 

If  we  have  two  sets,  one  with  m  events,  the  other  with 

m  +  1  events,  the  maximum  average  uncertainty  of  the  smaller 

set  is  less  than  the  maximum  average  uncertainty  of  the 

larger  one  .      This  can  be  seen  by  computing  and  noting  the 

right  side  of  (54)  for  both  cases  , 

-  log  1/m  <Z   -  log  ___   for  m  ^  1  .  (55) 

m  +  1 
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11  1. 

Thus  H  (—      —#...,    — ) 

id      in  m 


m 


m+l         m+1'  '  * '   m+1 


) 


(56) 


Consider    the    form   of   H  when    one   event   say   m,    is    composed 

of    two   sub-events    such    that  p      =   q      +   q^ . 

m  1  2 


m- 1 


H(p    ,    p    ,  ..    Pm       ,    q    ,    q   )    =     Yp.    log   p      +     Yq      log   q 
1         2  m-1         1  2  {__  l  1  £_j  l 


(57) 


and 


H(p    .    P    , ...   p   )    =       Vp,    log   p.    +  p      log   p 
1         ^  m  £_,  i  ini  m 


(58) 


Subtracting    (57)     from    (56)     and  combining    the    logarithms   of 
the    terms    on    the    extreme    right. 


H(p.  ,    P_,  ...   p      .,    q.,    qj    ««   H(p    ,    p    »  .  ..,    p    )    +   p      H 

i         2  ro-1         12  12  mm 


P        P 
m        m 


(59) 


We  had  previously  stated  that  H  was  descriptive  of  its 
set  of  arguments  but  not  that  it  was  unique.   Shannon    11,  12 
shows  that  the  only  function  continuous  in  p.  and  possessing 
properties  (56)  and  (59)  is  -  A)  P-  1°9"  P.  »  where  A  is  an 
arbitrary  constant,   Khinchin  j  6    shows  the  same  result 
using  properties  (44)  ,  (46)  and  (54)  . 

For  applications  of  this  statistic  the  reader  is  referred 
to  the  bibliography o 
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